Current approaches to punctuation in computational linguistics
نویسندگان
چکیده
Some recent studies in computat ional l inguist ics have aimed to take advantage of various cues presented by punctuation marks. This short survey is intended to summarise these research efforts and additionally, to outline a current perspect ive for the usage and functions of punctuation marks. We conclude by presenting an information-based f ramework for punctuation, influenced by treatments of several related phenomena in computational linguistics. Abbreviations: DRT discourse representat ion theory; DRS discourse representat ion structure; NLP natural language processing; NLG natural language generation; RST rhetorical structure theory; SDRT segmented discourse representation theory; SDRS segmented discourse representation structure
منابع مشابه
Presenting Punctuation
Until recently, punctuation has received very little attention in the linguistics and computational linguistics literature. Since the publication of Nunberg's (1990) monograph on the topic, however, punctuation has seen its stock begin to rise: spurred in part by Nunberg's ground-breaking work, a number of valuable inquiries have been subsequently undertaken, including Hovy and Arens (1991), Da...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملToward a punctuation checker for Basque
Until some years ago, researchers in computational linguistics have ignored punctuation. Nevertheless, since the publication of Nunberg’s monograph [Numberg G., 1990], punctuation works have increased [Bayraktar M. et al., 1998] [Hardt D., 2001] [Pala K. et al., 2003], and, recently, it is used more and more for different tasks of Natural Language Processing. Our research group has been working...
متن کاملPunctuation as Implicit Annotations for Chinese Word Segmentation
Paragraphs are composed of sentences. Hence when a paragraph begins, a sentence must begin, and as a paragraph closes, some sentence must finish. This observation is the basis of the sentence boundary detection method proposed by Riley (1989). Similarly, sentences consist of words. As a sentence begins or ends there must be word boundaries. Inspired by this notion, we invent a method to learn a...
متن کاملPacific Association for Computational Linguistics APPLYING MACHINE LEARNING FOR HIGH PERFORMANCE NAMED-ENTITY EXTRACTION
This paper describes a machine learning approach to build an efficient, accurate and fast name spotting system. Finding names in free text is an important task in addressing real-world textbased applications. Most previous approaches have been based on carefully hand-crafted modules encoding linguistic knowledge specific to the language and document genre. Such approaches have two drawbacks: th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computers and the Humanities
دوره 30 شماره
صفحات -
تاریخ انتشار 1996